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Abstract 



The conditional gradient method is a long-studied first-order opti- 
mization method for smooth convex optimization. Its main appeal is 
the low computational complexity: the conditional gradient method 
requires only a single linear optimization operation per iteration. In 
contrast, other first-order methods require a projection computation 
every iteration, which is the computational bottleneck for many large 
scale optimization tasks. The drawback of the conditional gradient 
method is its relatively slow convergence rate. 

In this work we present a new conditional gradient algorithm for 
smooth and strongly convex optimization over polyhedral sets with 
a linear convergence rate - an exponential improvement over previous 
results. 

We extend the algorithm to the online and stochastic settings, and give 
the first conditional gradient algorithm that attains optimal regret 
bounds for both arbitrary convex losses and strongly convex losses. 
Our online and stochastic algorithms require a single linear optimiza- 
tion step over the domain per iteration. 

1 Introduction 

First-order optimization methods, such as gradient- descent methods [HI dSl 
[T7] and conditional-gradient methods pi El lU El EE2] , are often the method of 
choice for coping with very large scale optimization tasks. While theoretically 
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attaining inferior convergence rate compared to other efficient optimization 
algorithms, e.g. interior point methods [IB], modern optimization problems 
are often so large that using second-order information or other super-linear 
operations becomes practically infeasible. 

The computational bottleneck of gradient descent methods is usually the 
computation of orthogonal projections onto the convex domain. This is also 
the case with proximal methods [T7] . 

A well-known remedy for the latter computational problem is the condi- 
tional gradient method, also known as the Frank- Wolfe method. Conditional- 
gradient methods are suitable for smooth optimization and do not rely on 
computing projections, instead each iteration is comprised of optimizing a 
single linear objective over the convex domain. The latter task is many 
times accomplishable using a very efficient combinatorial algorithm for many 
domains of interest. 

Our first contribution in this paper is the design of a new conditional- 
gradient-based algorithm for smooth and strongly convex optimization. Un- 
like previous methods, our new algorithm attains a linear rate of convergence 
over polytopes, thus running in polynomial time for many combinatorial op- 
timization problems. 



Setting 


Previous 


This paper 


Offline smooth and strongly convex 


r 1 p 


e -o(t) 


Stochastic, smooth losses 


t-va p2] 


r i/2 


Stochastic, non-smooth losses 


t-v 4 [12] 


r l/2 


Online, arbitrary losses 


T 3/4 


VT 


Online, strongly convex losses 


T 3/4 [J2] 


logT 



Table 1: Comparison of conditional gradient methods for optimization over poly- 
topes in various settings. In the offline and stochastic settings we give the con- 
vergence rates, omitting constants. In the online setting we give the order of the 
regret after T rounds. 

Using this new linearly converging algorithm, we are able to design new 
algorithms for stochastic and online convex optimization. Using the con- 
ditional gradient method for solving stochastic and online problems is of 
utmost importance in machine learning, due to the very large scale opti- 
mization problems that arise. Recently, a conditional gradient algorithm for 
stochastic and online optimization was given in [12], though its rate is sub- 
optimal. We give the first optimal-rate CG algorithms for these problems, 
thereby resolving the question of [12J. Our results are summarized in the 
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table E 



1.1 Related Work 

Offline optimization Conditional gradient methods for offline minimiza- 
tion of convex and smooth functions date back to the work of Frank and 
Wolfe [6] . More recent works of Clarkson [I] and Hazan [8] consider the con- 
ditional gradient method for the cases of optimization over the simplex and 
semidefinite cone respectively. Albeit its relatively slow convergence rates - 
additive error of the order 1/t after t, the benefit of the method is two folded: 
i) its computational simplicity - each iteration is comprised of optimizing a 
linear objective over the set and ii) it is known to produce sparse solutions 
(for the simplex this mean only a few non zeros entries, for the semidefinite 
cone this means that the solution has low rank). The results of [U [8] are 
generalized in [T3] , showing that the same rate of convergence is a achievable 
for arbitrary convex and compact sets as long as an oracle for optimizing a 
linear objective over the set is available. The convergence rate 1/t is also 
optimal for this method without further assumptions, as shown in l8| fT3]. 

For the special problem of solving convex linear systems - finding a point 
in the intersection of an affine set with a compact convex set, [2] gives a 
conditional gradient algorithm with linear convergence, but a slater condition 
assumption is needed, that is they assume that a solution exists that is far 
enough from the boundary of the convex set. The recent work of [1] uses 
the linear convergence result of [2] for the problem of learning in intractable 
markov random fields models. 

For the case in which the the objective function is smooth and strongly 
convex, an extension of the conditional-gradient algorithm with linear conver- 
gence rate was presented in [15], however their algorithms require to solve a 
regularized linear problem on each iteration which is computationally equiv- 
alent to computing projections. In case the convex set is a polytope, [7] 
has shown that the algorithm of [B] converges in linear rate assuming that 
the optimal point in the polytope is bounded away from the boundary. The 
convergence rate is proportional to a quadratic of the distance of the optimal 
point from the boundary. In our work we do not require the latter assumption 
and the convergence rate does not depend on the optimal solution. 

Online and Stochastic Optimization The two closest works to ours are 
[2] and [T2], in both no projections are used, and the only optimization car- 
ried out by the algorithms on each iteration is minimizing a linear objective 
over the decision set. [H] gives a random algorithm for the online setting 
in the special case in which all loss function are linear. In this setting their 
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algorithm achieves regret of 0(\/T) which is optimal. On iteration t their 
algorithm plays a point in the decision set that minimizes the cumulative 
loss on all previous iterations plus a random noise. The work of [12] intro- 
duces algorithms for stochastic and online optimization which are based on 
ideas similar to ours - using the conditional gradient update step and and the 
convergence of the RTFL algorithm [9]. For stochastic optimization, in the 
case that all loss functions are smooth they achieve an optimal convergence 
rate of 1/ vT, however for non-smooth stochastic optimization they only get 
convergence rate of T -1 / 4 and for the full adversarial setting of online convex 
optimization they get suboptimal regret that scales like T 3 / 4 . 

2 Preliminaries 

Given two vectors x, y we write x > y if every entry of x is greater or equal to 
the corresponding entry in y. We denote M r (x) the euclidean ball of radius r 
centred at x. We denote by ||x|| the l 2 norm of the vector x and by ||A|| the 
spectral norm of the matrix A, that is \\A\\ = max Ig B ||^||- Given a matrix 
A we denote by A(i) the vector that corresponds to the ith row of A. 

Definition 1. We say that a function f(x) : R n — > K. is Lipschitz with 
parameter L over the set K, C W 1 if for all x,y G /C it holds that, 

\f(x)-f(y)\<L\\x-y\\ 

Definition 2. We say that a function f(x) : W 1 — > M. is (3-smooth over the 
set 1C if for all x,y G /C it holds that, 

f(y) < f(x) + Vf(x) T (y - X ) + P\\x- y\\ 2 

Definition 3. We say that a function f{x) : M n — > R is a-strongly convex 
over the set JC if for all x, y G /C it holds that, 

f(y) > f(x) + Vf(x) T (y -x) + a\\x - y\\ 2 

The above definition together with first order optimality conditions imply 
that for a a-strongly convex /, if x* = argmin xg x: f(x), then for all x G /C 

f(x) — f(x*) > a\\x — x*\\ 

Given a polytope V = {1 6 l n A\x = bi, A 2 x < b 2 }, A 2 is m x n, let V 
denote the set of vertices of V and let N = \V\. We assume that V is 
bounded and we denote D{V) = ma,x xye p \\x — y\\. 
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We denote f(P) = mm„ 6V mm{b 2 {j) - A 2 (j)v : j G [m],A 2 (j)v < b 2 {j)}. 
Let r(A 2 ) denote the row rank of the matrix A 2 . Let A(V) denote the set 
of all r(A 2 ) x n matrices whose rows are linearly independent vectors chosen 
from the rows of A 2 and denote ipiV) = maxMeA(p) H-^lh- Finally denote 

m = ^p. 

Henceforth we shall use the shorthand notation of D,^,tp,fi when the 
polytope at hand is clear from the context. 

Throughout this work we will assume that we have access to an oracle 
for minimizing a linear objective over V. That is we are given a procedure 
O v : V ->■ R such that for all c G R n , O v (c) G argmin^v c T v. 



2.1 The Conditional Gradient Method and Local Lin- 
ear Oracles 

The conditional gradient method is a simple algorithm for minimizing a 
smooth convex function / over a convex set V - which in this work we assume 
to be a polytope. The appeal of the method is that it is a first order interior 
point method - the iterates always lie inside the convex set and thus no pro- 
jections are needed and the update step on each iteration simply requires to 
minimize a linear objective over the set. The basic algorithm is given below. 



Algorithm 1 Conditional Gradient 
1: Let x\ be an arbitrary point in V . 
2: for t = 1... do 

3: pt <- O v (Vf(x t )). 

4: x t+ i <- x t + a t (p t - x t ) for a t G (0, 1). 

5: end for 



Let x* = argmin xg yc f(x). The proof of convergence of algorithm [T] is due 
to the following simple observation. 

f(x t+1 ) - f{x*) (1) 

= f(x t + OLtipt ~ X t )) ~ f(x*) 

< f(x t ) - f(x*) + a t (p t - x t yVf(x t ) + alP\\p t - x t \\ 2 /3-smoothness of / 

< f{x t ) - f(x*) + a t (x* - x t yVf(x t ) + aff3\\p t - x t \\ 2 optimality of p t 
< f{x t ) - f(x*) + a t (f(x*) - f(x t )) + a 2 t f3\\p t - x t \\ 2 convexity of / 

<(l-a t )(f(x t )-f(x*)) + a 2 (3D 2 

The relatively slow convergence of the conditional gradient algorithm is due 
to the term \\p t — Xt\\ in the above analysis, that may remain as large as the 
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diameter of V while the term f(x t ) — f{x*) keeps on shrinking, that forces 
us to choose values of a t that decrease like | [U El ES] • 

Notice that if / is cr-strongly convex for some a > then knowing that for 
some iteration t it holds that f{xt) — fix*) < e implies that \\xt — x*\\ 2 < ^. 
Thus when choosing p t , denoting r = y/e/a, it is enough to consider points 
that lie in the intersection set Vr\M r (xt). In this case the term \\p t — xt\\ 2 will 
be of the same magnitude as f(x t ) — f(x*) (or even smaller) and as observable 
in (1T1), linear convergence may be attainable. However solving the problem 
min pg: p n B r ( a ; t ) p T Vf(x t ) is much more difficult than solving the original linear 
problem min pe pp T V/(xt) and is not straight-forward solvable using linear 
optimization over the original set alone. 

To overcome the problem of solving the linear problem in the intersection 
Vr\M r (x t ) we introduce the following definition which is a primary ingredient 
of our work. 

Definition 4 (Local Linear Oracle). We say that a procedure A(x, r, c), x G 
V , r G M + , c G M. n is a Local Linear Oracle for the polytope V with parameter 
p, if A(x, r, c) returns a point p G V such that: 

1. \/y G M(x, r) fl V it holds that c T y > c T p. 

2. \\x — p\\ < p ■ r. 

The local linear oracle (LLO) relaxes the problem min pe p n B r (x t ) p T V/(x t ) 
by solving the linear problem on a larger set, but one that still has a diameter 
that is not much larger than f(xt) — f(x*). Our main contribution is 
showing that for a polytope V a local linear oracle can be constructed such 
that the parameter p depends only on dimension n and the quantity p(V). 
Moreover the construction requires only a single call to the oracle O-p. 

2.2 Online and Stochastic Convex Optimization 

In the problem of online convex optimization (OCO) [201 HOI Ej, a decision 
maker is iteratively required to choose a point x t G /C where JC is a convex 
set. After choosing the point Xt a convex loss function ft{x) is revealed 
and the decision maker incurs loss ft{xt)- The emphasis in this model is 
that the loss function at time t may be chosen completely arbitrarily and 
even in an adversarial manner given the current and past decisions of the 
decision maker. The goal of the decision maker it to minimize his overall loss 
and performance is measured in terms of regret - the difference between the 
overall loss of the decision maker and that of the best fixed point in K in 
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hindsight. Formally the regret after T rounds is given by 



^2ft(x t ) -min^2 f t (x) 
t=i x t=i 

The main challenge in this setting is to design algorithms with regret bounds 
that are sublinear in the number of rounds - T. For general convex losses the 
optimal regret bound attainable scales like y/T [3]. In the case that all loss 
function are strongly convex the optimal regret bound attainable scales like 
log(T) HI]. 

A simple algorithm that attains regret of O(VT) for general convex losses 
is known as the Regularized Follows The Leader algorithm (RTFL) [5J. On 
time t the algorithm predicts according to the following rule. 

x t <- min < r) Vf T {x T ) T x + TZ(x) > (2) 
xGK. i ^- — ' 



t=l 



Where rj is a parameter known as the learning rate and TZ is a strongly convex 
function known as the regularization. From an offline optimization point of 
view, achieving low regret is thus equivalent to solving a strongly-convex 
quadratic minimization problem on every iteration. We will show how to 
attain low regret by using only a linear optimization oracle using the ideas 
from subsection 12.11 



2.2.1 Stochastic Optimization 

In stochastic optimization the goal is to minimize a convex function F(x) 
over the convex set V where we assume that there exists a distribution T> 
over a set of functions such that F = E/^d[/]. In this setting we don't have 
direct access to the function F, instead we assume to have a random oracle 
for F that we can query, which returns a function / that is sampled according 
to the distribution T> independently of previous samples. Thus if the oracle 
returns a function / it holds that E[/(x)] = F(x) for all 
The general setting of online convex optimization is strictly harder than that 
of stochastic optimization, in the sense that stochastic optimization can be 
simulated in the setting of online convex optimization - on each iteration t 
the loss function revealed ft is produced by a query to the oracle of F. In this 
case denote by TZt the regret of an algorithm for online convex optimization 
after T iterations. That is, 

T T 

ft{x t ) - min ^ f t (x) = TZ T 
t=i t=i 



7 



Denoting x* = argmin^gp F(x) we thus in particular have, 



T T 



t=l t=l 



Taking expectation over the randomness of the oracle for F and dividing by 
T we have, 



Thus the same regret rates that are attainable for online convex optimization 
hold as convergence rates, or sample complexity, for stochastic optimization. 

3 Our Results 

In all of our results we assume that we perform optimization (either offline 
or online) over a polytope V and that we have access to an oracle O-p that 
given a linear objective c £ IR n returns a vertex of V - v £ V that minimizes 
c over V. 



Offline Optimization Given a /3-smooth, cx-strongly convex function f(x) 
we present an iterative algorithm that after t iterations returns a point xt+i £ 
V such that 



where x* = argmin x6 p f(x) and C = f{x{) — fix*). The algorithm makes a 
total of t calls to the linear oracle of V. 

Online and Stochastic Optimization We present an algorithm for on- 
line convex optimization such that 

1. For arbitrary convex loss functions the regret after T rounds is 




t=i 





T 



T 



— mm 



t=i 



t=i 
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2. If all loss functions are if -strongly convex then the regret after T rounds 
is 

£/,(*,) - mm £ /<(**) = O ( (G + ™ )V l°g(r)) 

t=l t=l ^ ' 

The algorithm performs a single call to the linear oracle of V per iteration. 

As discussed in subsection 12.2.11 online regret bounds could be directly 
translated to convergence rates for stochastic optimization. 

4 A New Linearly Convergent Algorithm for 
Offline Optimization 

In this section we present and analyse an algorithm for the following offline 
optimization problem 

min/(x) (3) 

where we assume that / is /3-smooth and u-strongly convex and V is a poly- 
tope. We assume that we have a LLO oracle for V - A(x,r,c). In section [5] 
we show that given an oracle for linear minimization over V , such a LLO 
oracle can be constructed. 
Our algorithm for ([3]) is given below, 

Algorithm 2 

1: Input: A(x,r,c) - LLO with parameter p. 

2: Let X\ be an arbitrary point in V and let C > f(x±) — f(x*). 

3: Let a = ^2. 

4: for t = 1... do^ 

5: r t <- min{J% e-" 2 ^ 1 ), D}. 

6: Pt<-A(x t ,r u Vf(x t )). 
7: x t+1 <- x t + a(p t - x t ). 
8: end for 



Theorem 1. After t > 1 iterations algorithm^ has made t calls to the linear 
oracle O-p and the point xt+\ € V satisfies 

/(ll+1 )- /(l .)< C exp(-^) 

where x* = argmin xg: p f(x) and C = f(x\) — f(x*). 
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We now turn to analyse the convergence rate of algorithm^ The following 
lemma is of general interest and will be used also in the section on online 
optimization. 

Lemma 1. Assume that f(x) is (3-smooth. Let x* = argmin xe p f(x) and 
assume that on iteration t it holds that \\Xt — < r t . Then for every 
a G (0, 1) it holds that, 

f(x t+1 ) - f(x*) < (1 - a) (f(x t ) - f(x*)) + /3a 2 min{p 2 r 2 , D 2 } 

Proof. By the /3-smoothness of f(x) and the update step of algorithm [2] we 
have, 

f(x t+1 ) = f(x t + a(p t - x t )) < f(x t ) + aVf(x t ) T (p t - x t ) + (3a 2 \\p t - x t \\' 

Since \\x t — x^ +1 \\ < r t , by the definition of the oracle A it holds that i) 
pJ*Vf(x t ) < x* T \7F t (x t ) and ii) \\x t — Pt\\ < min{pr u D}. Thus we have 
that, 

f(x t+ i) < f(x t ) + aVf(x t ) T (x* -x t ) + (3a 2 mm{p 2 r 2 t ,D 2 } 
Using the convexity of f(x) and subtracting f(x*) from both sides we have, 
f(x t+1 ) - f(x*) < (1 - a) (f(x t ) - f(x*)) + /3a 2 min{p 2 r 2 , D 2 } 

□ 

Lemma 2. Let a be as in algorithm^ Denote h t = f{x*) — f(xt) and let 
C > hi. Then 

h t < Ce~^ {t ~ l) Vt > 1 

Proof. The proof is by a simple induction. For t — 1 we have that hi = 
f(x*) - f(xi) < C. 

Now assume that the theorem holds for t > 1 . This implies via the strong 
convexity of f(x) that 

\\x t - x*\\ 2 < -(f(x*) - f(x t )) = -ht < - e -^ {t - 1] 
a aa 

Setting r t such that r| = T ^p I , we have that x* G P PI B rt (x t ). Ap- 
plying lemma [1] with respect to Xt and by induction we get, 

h t+ i < (l-a^e-^-^ + ^-.Ce-^^ 

a 

a 
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By plugging the value of a and using (1 — x) < e x we have 

□ 

We can now prove theorem Q] 

Proof. In lemma we prove that a LLO oracle with parameter p = ^Jrip, 
and in lemma |6] we shows that this construction requires a single call to the 
oracle O-p. The convergence results thus follows from[2j □ 

5 Constructing a Local Linear Oracle 

In this subsection we show how to construct an algorithm for the procedure 
A(x, r, c) given only an oracle that minimizes a linear objective over the 
polytope V. 

The algorithm for the local linear oracle is given below. Note that the 
algorithm assumes that the input point x is given in the form of a convex 
combination of vertices of the polytope. Later on we show that maintaining 
such a decomposition of the input point x is easy. 



Algorithm 3 

1: Input: a point x G V such that x = Yli=i ^i v i Ai > 0' Yli=i Ai = 1, v i e V-> 

radius r > 0, linear objective c G W L . 
2: A <- min{^pr, 1}. 
3: Vi G [k\: U <- c T Vi. 

4: Let i\, ...ik be a permutation over [k] such that > U 2 > ■■■h k - 

5: for j = l...k do 

6: XI <- max{0, X { . - A}. 

7: aVA-(A 4j -\). 

8: end for 

9: V <r- O v (c). 

10: return p <- £* = i K v i + (l ~ Eti -M) «■ 



Lemma 3. Lei x G P i/ie mpui to algorithm and lei y G P. Write 
y = JZ i=1 (Aj — Aj)Uj + (X)i=i / or some A, G [0, Aj] and z6? suc/i i/iai 
i/ie sum Xli=i ^ s minimized. Then Wi G [&] i/iere exists an index j G [m] 
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Proof. Denote A = X^i=i Aj- Assume the lemma is false and let i' G [k] such 
that Vj G [m] it holds that if A 2 (j)vi< < b 2 (j) then A 2 (j)z < b 2 (j). Given 
j G [m] we consider two cases. If A 2 (j)vi> = b 2 (j) then for any 7 G (0, 1] it 
holds that A 2 (j)(z - yo v ) < b 2 (j) - 762 0') = (1 - 7)^0') 

If ^2(^)^1' < b 2 (j) then by the assumption A 2 (j)z < b 2 (j) and there exists 
scalar s €j, Sj > such that A 2 (j)vi' = b 2 (j) —5j and A 2 (j)z = b 2 (j) —ej. Now 
given a scalar 7 > it holds that A 2 (j)(z — 71V ) = &2O') ~~ e j ~l(b 2 (j) — Sj) = 
(1 - 7)62(7) - (ej - Choosing 7 < ^ we get that A 2 (j)(z - 7^) < 

(l-7)62(7)- 

Combining the two cases above we conclude that for all 7 < min{l, min je r m i{ 
it holds that A 2 (z — 7ity) < (1 — 7)62- Since it also holds that A\{z — 7^') = 
(1 — 7)61 we have that 2; — 7^' G (1 — 7)P. 

Thus in particular by choosing 7 such that 7 < there exists w £ V 
such that z = (1 — 7)11; + 7^/ and, 

fe 

= J^(Ai - Ai)v< + A((l - >y)w + yv v ) 
i=i 

= ( (Xi - Ai)vi + (Xi> 
= ( XI (Ai-AiK + (Ai/ 

Thus by defining Vi G [fc] , i 7^ i' Aj = A, and A*/ = A^ — 7 A we have that 
V = T,i=i( X i ~ A iH + (EiU Ai)w with Y!l=i Ai = (1 - 7) ELi A i which 
contradicts the minimality of Ei=i Aj. □ 

Claim 1. Let z G V and denote C(z) = {i G [m] : A 2 (i)z = b 2 (i)} and 
let Cq{z) C C(2) fre suc/i that the set {A 2 (i)}i^c (z) is a basis for the set 
{A 2 {i)} i&C (z)- Then given y &V, if there exists i G C(z) such that A 2 {i)y < 
b 2 {i) then there exists i G C (z) such that A 2 (i )y < b 2 (i ). 

Proof. Assume by way of contradiction that there exists t/6P and i G C(z) 
such that A 2 {i)y < b 2 {i) and for any j G Cq(z) it holds that A 2 (j)y = b 2 (j). 
Since A 2 (i) is a linear combination of vectors from {A 2 (i)} ie c (z) it follows 
that the linear system A 2 (j)x = b 2 (j), j G C (z) U {i} has no solution which 
is a contradiction to the assumption that A 2 (i)z = b 2 {i) Vz G C(z). □ 



(Aj/ - 7A) - 7A)?v + A(l - 7)10 + ^Avi 



(Ai, - 1 A)v v + A(l - i)w 
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Lemma 4. Let x G V the input to algorithm [3] and let y G V such that 
\\x - y\\ < r. Write y = Y%=i(^ ~ + (SiU ^i) z f or some A i e t ' ^ 
and z EV such that the sum £) i=1 A, minimized. Then Yli=i Aj — 



\/mj> 



r . 



Proof. Denote C(z) = {j G [m] : A 2 (j)z = b 2 (j)} and let C (z) C C(z) such 
that the set of vectors {y4 2 (i)}j 6 c (z) is a basis for the set {A 2 (i)} i£ c(z)- De- 
note A 2jZ G Ml c ' ( 2 )l x ' 1 the matrix A 2 after deleting every row i (fc Cq(z) and 
recall that by definition \\A 2iZ \\ < ip. Then it holds that, 

k 1 f k N 

\\x-y\\ 2 = C^-zjl^p — ||A 2 , 2 (^A^-z) 

i=l \ i=l / 

A: / A; 

i=l j6C (z) \i=l 



Note that |Co(^)| < n and that for any vector x G M^ ^)! it holds that 
\\x\\ > , 1 IMIi- Thus we have that, 



|^ — 2/ 1 1 2 > 




^A,(A 2 (j)^-& 2 (i)) 



2 

£ ^TA^j)-^^ 



Combining lemma [3] and claim |5l we have that for all i G [k] such that 
A, > there exists j G Cq(z) such that A 2 (j)vi < b 2 (j) — £. Hence, 



2 



1 'x>, =4*(|> 



Since ||x — y|| 2 < r 2 we conclude that 5^ =1 A« < :y ^r. □ 

The following lemma establishes that algorithm |3] is a local linear oracle 
for V with parameter p = y/n/i. 

Lemma 5. Assume that the input to algorithm^ is x = Yli=i ^i v i suc h that 
Vz G [k], Aj > 0,Vi G V and 5^ i=1 Aj = 1. Lei p fre i/ie point returned by 
algorithmic Then the following conditions holds: 

1. peV. 
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2. ||sc — p|| < \fn\ir. 

3. Vy G B r (x) n V it holds that c T y > c T p. 

Proof. Condition 1. holds since p is given as convex combination of points 
in V. 

For condition 2. note that, 

\\x-p\\ = lE( Ai_ K) v i~ ( l ~^ZK 

i=l \ i=l 

k k 

= E^-A^H- £(A,-A>|| 

i=l i=l 
k k 

= ll^Oi ~ K)( V i ~ V )\\ < - K)\\ v i ~ V \ 

i=l i=l 

where the last inequality is due to the triangle inequality. 



According to algorithm [3] it holds that 5^i=i(^i ~ K) — ^j^-r and thus 



\\x — p\\ < ^^ D r = y/n/jr. 

Finally, for condition 3, let y G M r (x) n P. From lemma H] we can write 

V = J2i=i(^i ~ A *M + fei=i A i) z sucn that A * G [0,A.j], z e V and 

£)* =1 Aj = min{^r, 1}. Thus, 



cT ^/ = Z^ Ai - A *) cT ^ + X! A * 

i=l \i=l 
fc / k \ 

= E A 

i=l \i=l J 

k / k \ 

> ^(A i -A i )^+ hc A 

i=l \i=l / 



c T z 



k / k 

Z(A ij -A ij )/, j + X>) c 
i=i \i=i 



Since algorithm [3] reduces the weights of the vertices Vi according to a de- 
creasing order of li we have that Y^ = i(\ — Aj.)Zj. > 5^- =] A^.Zj.. Thus we 

J 3 3 3 J 3 ^ 

conclude that c T y > Kjij + (Si=i A *) ^ = c V P- 1=1 

Note that algorithm |3] assumes that the input point x is given by its 
convex decomposition into vertices. All optimization algorithms in this work 
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use [3] in the following way: they give as input to E] the current iterate x t and 
then given the output of algorithm [3J Pt they produce the next iterate x t +\ by 
a taking a convex combination x t + 1 <r- (1 — a)xt + apt for some parameter 
a G (0, 1). Thus if the convex decomposition of x t is given, updating it to the 
convex decomposition of Xt+i is straightforward. Moreover, denoting Vt Q V 
the set of vertices that form the convex decomposition of x t , it is clear from 
algorithm [3] that \V t +\ \Vt\ < 1 ; since at most a single vertex (v) is added to 
the decomposition. 

Lemma 6. Algorithm^ has an implementation such that each invocation of 
the algorithm requires a single call to the oracle Op and additional 0(T(n + 
log T) time where T is the total number of calls to algorithm [3j 

Proof. Clearly algorithm [3J calls Op only once. The complexity of all other 
operations depends on k - the number of vertices in the convex decomposition 
of the input point x. As we discussed, if we denote by x t , Xt+\ the inputs to 
the algorithm on calls number t, t + 1 to the algorithm and by k t ,kt+i the 
number of vertices in the convex decompistion of x t ,x t +i respectively then 
kt+i < k t + 1. Thus if the algorithm is called a total number of T times 
and the initial point (xi) is a vertex, then at all times k < T. Since all 
other operations except for calling Op consist of computing k inner products 
between vectors in W 1 and sorting k scalars, the lemma follows. □ 

Note that we can get rid of the linear dependence on T in the bound in 
lemma [6] by decomposing the iterate Xt into a convex sum of fewer vertices in 
case the number of vertices in the current decomposition (k) becomes to large. 
From Caratheodory's theorem we know that there exists a decomposition 
with at most n + 1 vertices and for many polytopes of interest there is an 
efficient algorithm for computing such a decomposition. Another method for 
computing such a small decomposition is by boot-strapping algorithm |2] for 
solving the optimization problem min^g-p \\Xt — x\\ 2 . 

6 Online and Stochastic Convex Optimiza- 
tion 

In this section we present algorithms for the general setting of online convex 
optimization that are suitable when the decision set is a polytope. We present 
regret bounds for both general convex losses and for strongly convex losses. 

Our algorithm for online convex optimization is given below. The func- 
tions Ft(x), used by the algorithm (in line 6), will be specified precisely in 
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the analysis. Informally, F t (x) aggregates information on the loss functions 
on all times l..t plus some regularization term. 



Algorithm 4 

1: Input:horizon T, learning rate r] > 0, set of radii {r t }f =l t G [T], opti- 
mization parameter a G (0, 1), LLO A with parameter p. 
2: Let X\ be an arbitrary vertex in V. 
3: for t = 1...T do 
4: play x t . 
5: receive ft- 
6: p t A(x t ,r t , VF t (x t )). 
7: x t+1 <r- x t + a(p t - Xt)- 
8: end for 



We have the following two main results. 

Denote G = sup xgP te[T] ||V/ t (z)||. 
Theorem 2. For general convex losses the regret of algorithm^is 0(GDp\/T). 

Theorem 3. If all loss functions ft{x) are H-strongly convex then the regret 
of algorithm^ is 0((G + HD) 2 p 2 /H) logT). 

6.1 Analysis for general convex losses 

For time t G [T] we define the function F t (x) = rj (X)t=i ^fr(x T ) T x) + 
||x — Xi|| 2 where t] is a parameter that will by determined in the analysis. 

Denote x\ = X\ and for all t G [T — 1] a^ +1 = arg min^gp F t (x) . Denote 
also x* = argmin xg p J2t=i ft( x )- Observe that F t (x) is 1-smooth and 1- 
strongly convex. 

Lemma 7. There is a choice for the parameters r],a,r t such that for any 
e > it holds that for all t G [T]: \\x t — x^W < yfe. 

Proof. We prove by induction that for all t G [T] it holds that — 
Ft-i( x t) — e - By the strong- convexity of F t ^± this yields that \\x t — x^\\ < 

The proof is by induction on t. For t — 1 it holds that x\ = x\ and thus 
the claim holds. Thus assume that for time t > 1 it holds that F t -i(x t ) — 
F t -i(x* t ) < e. By the strong-convexity of F t _i(x) and the assumption that 
the claim holds for time t we have that, 

\\xt - xl\\ < y/e (4) 
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By the definition of F t (x) and x^ we have that F t (xl) — F t (xl +1 ) = F t _i(xt) — 
Ft-i(x* t+l ) + rjS/ f t (x t ) T (x* t — x* t+l ) < rjG\\x* t+l — x* t \\ and thus again by the 
strong convexity of F t {x) we have that 

114+1 -411 < V G (5) 
Combining (EE)), (131) we have, 

\\x t -x* t+l \\ <Ve + r]G 

By induction, 

F t (x t ) - F t (x* t+1 ) = F t _ 1 (x t )- J F t _ 1 (x t * +1 ) + 77V/t(x t ) T (^-x t * +1 ) 

< e + ^||^-x* +1 || <e + V G^T t + V 2 G 2 (6) 

Setting r t = \/e + r\G we can apply lemma [T] with respect to F t (x) and get, 

F t (x t+1 ) - F t {x* t+1 ) < (1 - a)(F t {x t ) - F t (x* t+1 )) + a 2 p 2 (y/i + V G) 2 

Plugging (jnD, 

F t (x t+1 ) - F t (x* t+1 ) < (1 - a) (e + r,Gyfe + ifG 2 ) + 2a 2 p 2 (e + r,G^~e + r/ 2 G 2 ) 
Setting o; = ^ we get, 

op 

F t (x t+1 ) - F t (x* t+1 ) < (e + V GV~e + r?G 2 ) (l - ^) 
Plugging r] = gives 

□ 

we are now ready to prove theorem [2j 

Proof. Observe that playing the point x"[ +1 = arg min x6 p F t (x) on each time 
t is equivalent to playing the leader on each time with respect to the loss 
functions f[(x) = V/i(xi) T x + ^\\x — Xi\\ 2 and f' t (x) = Vft(x) T x for every 
t > 1. This strategy of playing on each time according to the leader is known 
to achieve overall zero regret, see [H]. Thus, 

j2 vf t (x t ) T (x* t+1 - x*) < -(\\x* - Xl \\ 2 - k - ziin < — 

~ V V 
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By the definition of F t (x), x* t and the use of strong-convexity we have that, 

K - x* t+1 \\ 2 < F t (x* t ) - F t (x* t+1 ) < V G\\x* t - x* t+1 \\ 
Which implies by the triangle inequality that, 

T D 2 
^Vf t (x t ) T (x* t -x*)<— + T V G 2 

t=i ^ 

Setting rj to the value determined in lemma [3, plugging lemma [7| and by the 
convexity of ft() we get that for all e > 0, 

£ ft(x t ) - f t (x*) < £ Vf t (x t ) T (x t - x*) < + ^ + TGVe 

The theorem now follows from plugging e = ^= . □ 
6.2 Analysis for strongly convex losses 

Assume all loss functions are if -strongly-convex. For time t G [T] define 

ft(x) = Vf t (x t ) T x + H\\x - x t \\ 2 and F t (x) = (£*=i / T (z)) +^T ||a; - x^ 2 

for some constant T that will be determined later. Observe that F t (x) is 
H(t + T )-smooth and H(t + T )-strongly convex. 
Denote L = G + 2HD. 

Claim 2. For all t G [T], /t(a;) L-Lipschitz over V. 
Proof. Given two points x, y G V it holds that 

ft(x)-ft(y) = Vf t (x t ) T (x-y) + H\\x-x t f -H\\y-x t f 
< G\\x - y\\ + if(||x - x t || 2 - \\y - x t \\ 2 ) 

Using the convexity of the function g{x) = \\x — Xt\\ 2 we have, 

ft(x) - ft(y) < G\\x - y\\ + 2H(x - x t ) J (x - y) < G\\x - y\\ + 2HD\\x - y\\ 

The same argument clearly holds for ft(y) — ft{x) and thus the claim follows. 

□ 

Lemma 8. There is a choice for the parameters a,r t ,To such that for any 
t G [T] it holds that \\x t - x* t \\ < 2 ^ k - 
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Proof. The proof is similar to that of lemma We prove that for any time 
t G [T], F t _i(x t ) — F t _i(x* t ) < e t for some e t > which by strong convexity 
implies that \\x t - x* t \\ < ^ H{t Jl +To) ■ 

Clearly for time t = 1 the claim holds since Xi = x*. Assume that on time t 
it holds that F t -\(x* t ) — F t ~i(xt) < e t . Thus as we have shown, 



It also holds that F t (x* t ) - F t (x* t+1 ) = F^fa) - F t -i(x* +1 ) + f t (x* t ) - 
ft{x*+i) - ft(xt)-f t (x* t+l ). By claim[2jwe thus have that F t (x* t ) - F t (x* t+l ) < 
L\\xl — x* +1 \\. By strong-convexity of F t (x) we have, 

T 

UC+ii Xi ^ 



Combining ([7]), ()8]) we have, 



I|X< 4+1,1 ~\J H{t-l + T Q ) + H (t + T ) (9) 
By induction and claim [5J 

F t (z t ) - F t (x* t+1 ) = F t ^(x t ) - F t ^(x* t+l ) + f t (x t ) - f t (x* t+1 ) 



^H(t-1+T ) H(t + T 



Setting r t to the bound in ([9]), applying lemma [T] with respect to F t (x) we 
have, 



F t (x t+1 ) - F t (x* t+1 ) < (l-a)(F t (x t )-F t (x* t+1 )) + H(t + T )a 2 p 

LJT t L 2 
< (1 - a U + = V = = + 



2 „2 2 



v/^(t-l+T ) H(t + T ) 

2H(t + T )a 2 p 2 f TTU % rri \ + 



H(t-1+T ) IP(t + T ) 
< (1 - a) U + t= V = = + 



+ 2H(t + T )aV f | L 2 



#(t-l + T ) V H{t + T 0j 



L L 2 

< I H =V^+ T77 ^rr)(l-a + 4a 2 

v/i/(t-l + T ) V ^(t + T )/ 
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Setting a = we get 



F t (x t+1 ) - F t (x* t+1 ) < let + y=S= =f? + 



y/H(t-l+T ) H(t + T ) J V 25p 



Assume now that e t < ^(t+To) ■ Then we have, 



F,(x w ) - F t (xt +l ) < ffl^(l + -i 3 + 7 -i Ri )(l- 



J/(i + T ) V 50p 2 (50p 2 ) 2 y V 25 P 



#(t + T ) V 25p 2 7 V 25p s 
(100p 2 L) 2 / 1 \ 

H(t + T Q )\ (25p 2 ) 2 ) 

Finally, setting T = (25p 2 ) 2 we have that, 

(100p 2 L) 2 t + T (100p 2 L) 2 
if(t + T ) ' t + 1 + T ~ H (t + 1 + T ) 

Thus for all t, F t (x t +i) — F t (x* t+l ) < H^^Sf?) 2 ) ' an< ^ ^ ne l emma follows. □ 

We are now ready to prove theorem [3j 

Proof. Following the lines of theorem [2] and noticing that on time t, x^ +1 is 
the leader with respect to the loss functions f[(x) = fx(x) + HT \\x — Xi\\ 2 
and ft(x) = ft(x) for alH > 1 we have that, 



T 

5]/t(4n) - AOO < HT (\\x* - Xl \\ 2 - -X!|| 2 ) < HT D 2 
t=i 

By the definition of T t (x), the use of strong-convexity and claim [2] we 
have that, 

Ht\\x* t - x* t+1 \\ 2 < H(t + T )\\x;-x* t+1 \\ 2 <F t (x* t )-F t (x* t+1 ) 
= f t (x* t ) - f t (x* t+1 ) < L\\x* t - x* t+l \\ 
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Which implies that, 

T T 

t=i t=i 

T T 

t=i t=i 

T T 

< HT D 2 + £ f t {x t ) - f t {<) + £ f t (x* t ) - f t (x* t+1 ) 

t=\ t=i 

T 

< HT Q D 2 + J2L{\\x t -x* t \\ + \\x* t - x* t+1 \\) 

t=l 

T 

< HT,D 2 + Y J L(\\x t -x* t \\ + ^-\ 

t=i \ ' 

Plugging the value of T chosen in lemma [8] and the result of lemma [8] we 

have, 

£ f t (x t ) - f t (x*) < O(HDV) + 2 f2 '-^f = o(?£ logr) 



The theorem follows from the observation that for all t, since ft(x) is H- 
strongly convex it holds that, 

f t (x t ) - f t (x*) < Vf t (x t y(x t -x*)-H\\x t -x*\\ 2 

= Vf t (x t ) T (x t - x*) - H(\\x t -x*\\ 2 - \\x t - x t \\ 2 ) 
= /«(**)- /«(**) 

□ 
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